Single Point of Failure in Cloud Computing

_ October 28, 2021_ Maria Grazia

The most discussed downturn of last October 5 makes us reflect on how much a company or business needs more “ways out” in the event of an emergency.
Obviously, the blackout we are referring to is the one that affected the “Facebook” platform, which kept millions of users in suspense for more than six hours. What sent not only the social network, but the entire Facebook company into a tailspin was an error that gave rise to a series of interruptions in the system and that prevented not only users’ access to the platform, but also inefficiencies and inactivity to employees. What we would like to focus on is “How dependent can a company be on a single point of failure?”

Zuckerberg's reign in blackout. Down Facebook, Whatsapp and Instagram.

What is a Single Point of Failure?

Let’s start by defining a Single Point of Failure.
“A SPOF or single point of failure is any non-redundant part of a system that, if dysfunctional, can cause the entire system to fail. A single point of failure is antithetical to the goal of high availability in a computer system or network, software application, business practice, or any other industrial system.”
The question arises, therefore, spontaneously… How can this be avoided?

Eliminating SPOFs in Cloud Computing

Redundancy and high-availability clusters are key factors to avoid SPOFs. Both logical redundancy and physical redundancy. High-availability clusters minimize disruptions (99.99% availability) of system components included in the cloud.
Physical redundancy can be achieved with high cluster availability. No hardware or software should rely on a single piece of hardware in any case. It is essential to mitigate the server as highly available in the Cloud by improving the physical architecture with more routers and switches. The Data Center architecture should ensure physical redundancy in such a way as to avoid one-way communication between the Cloud and the system components.

Disaster Recovery. How much is a "contingency" plan worth?

What is the Disaster Recovery Plan and what are the benefits?

The sudden interruption of a set of services all linked to a single configuration can make us reflect on how important it has become to always have an
emergency plan
that can be adopted in sudden, but above all unpredictable, situations. Just think of natural catastrophes such as earthquakes, tidal waves, fires that can hit a business and leave you with no way out.

OVH suffers a major blow from a fire in a datacentre in Strasbourg

The Disaster Recovery Plan is a process related to preparing for the recovery and continuity of a company’s vital services after a natural event or human error. It consists of a set of phases including:

Testing: After installing the DR solution, you need to test it. “Game day” is when you perform a failover to your DR environment.
Monitoring and Alerting: You need to have regular audits and sufficient monitoring to alert you if your DR environment has been affected by server failures, connectivity issues, and application issues.
Backups: Once you have implemented your DR environment, you should continue to perform regular backups. Periodic backup and recovery testing is essential as a fallback solution.
User Access: You can secure access to resources in your DR environment using AWS Identity and Access Management (IAM).
Automation: You can automate the deployment of applications to AWS-based servers and on-premises servers using configuration management software.

Are you interested in a Disaster Recovery plan? Explore VMEngine Solutions

Talk to a Cloud Architect

Single Blog